Python to R Translation of tokenizer count

November 16, 2023

Role: a programmer proficient in Python and R. Task: Translate code from Python to R. When something is not possible, ask the user how to proceed. Please translate the code below from python to R code: " " " def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613"): """Return the number of tokens used by a list of messages.""" try: encoding = tiktoken.encoding_for_model(model) except KeyError: print("Warning: model not found. Using cl100k_base encoding.") encoding = tiktoken.get_encoding("cl100k_base") if model in { "gpt-3.5-turbo-0613", "gpt-3.5-turbo-16k-0613", "gpt-4-0314", "gpt-4-32k-0314", "gpt-4-0613", "gpt-4-32k-0613", }: tokens_per_message = 3 tokens_per_name = 1 elif model == "gpt-3.5-turbo-0301": tokens_per_message = 4 # every message follows {role/name}\n{content}\n tokens_per_name = -1 # if there's a name, the role is omitted elif "gpt-3.5-turbo" in model: print("Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0613.") return num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613") elif "gpt-4" in model: print("Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.") return num_tokens_from_messages(messages, model="gpt-4-0613") else: raise NotImplementedError( f"""num_tokens_from_messages() is not implemented for model {model}. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.""" ) num_tokens = 0 for message in messages: num_tokens += tokens_per_message for key, value in message.items(): num_tokens += len(encoding.encode(value)) if key == "name": num_tokens += tokens_per_name num_tokens += 3 # every reply is primed with assistant return num_tokens " " "

To translate this Python code to R, we need to consider that R handles functions, exceptions, and data structures differently. Here's the translated code:

r
num_tokens_from_messages <- function(messages, model="gpt-3.5-turbo-0613") {
    # Try to get encoding based on model
    encoding <- tryCatch({
        tiktoken$encoding_for_model(model)
    }, error = function(e) {
        message("Warning: model not found. Using cl100k_base encoding.")
        tiktoken$get_encoding("cl100k_base")
    })

    tokens_per_message <- 0
    tokens_per_name <- 0

    if (model %in% c("gpt-3.5-turbo-0613", "gpt-3.5-turbo-16k-0613", "gpt-4-0314", 
                    "gpt-4-32k-0314", "gpt-4-0613", "gpt-4-32k-0613")) {
        tokens_per_message <- 3
        tokens_per_name <- 1
    } else if (model == "gpt-3.5-turbo-0301") {
        tokens_per_message <- 4
        tokens_per_name <- -1
    } else if (grepl("gpt-3.5-turbo", model)) {
        message("Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0613.")
        return(num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613"))
    } else if (grepl("gpt-4", model)) {
        message("Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.")
        return(num_tokens_from_messages(messages, model="gpt-4-0613"))
    } else {
        stop(sprintf("num_tokens_from_messages() is not implemented for model %s. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.", model))
    }

    num_tokens <- 0

    for (message in messages) {
        num_tokens <- num_tokens + tokens_per_message
        for (key in names(message)) {
            value <- message[[key]]
            num_tokens <- num_tokens + length(encoding$encode(value))
            if (key == "name") {
                num_tokens <- num_tokens + tokens_per_name
            }
        }
    }

    num_tokens <- num_tokens + 3 # every reply is primed with assistant
    return(num_tokens)
}

Some notes on the translation:

R does not have a direct equivalent of Python's try-except. Instead, I used tryCatch which works similarly.
Instead of dict.items() in Python, I used names() to get the keys and then accessed values using [[key]] in R.
R doesn't have += operator, so I used num_tokens <- num_tokens + ... for accumulation.
R handles strings and error messages a bit differently, so I used sprintf for string formatting and message for warnings.

Please let me know if there's anything specific in the code you'd like to alter or if there's functionality you'd like to add.

There is no tiktoken for R available to my knowledge. Instead, please use the convention that every individual word is approximately 4/3 tokens.

Understood. In the absence of a direct R equivalent for the tiktoken library, we'll use a simplified estimation where each individual word is approximately 4/3 tokens. Here's the revised R code:

r
num_tokens_from_messages <- function(messages, model="gpt-3.5-turbo-0613") {
    tokens_per_message <- 0
    tokens_per_name <- 0

    if (model %in% c("gpt-3.5-turbo-0613", "gpt-3.5-turbo-16k-0613", "gpt-4-0314", 
                    "gpt-4-32k-0314", "gpt-4-0613", "gpt-4-32k-0613")) {
        tokens_per_message <- 3
        tokens_per_name <- 1
    } else if (model == "gpt-3.5-turbo-0301") {
        tokens_per_message <- 4
        tokens_per_name <- -1
    } else if (grepl("gpt-3.5-turbo", model)) {
        message("Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0613.")
        return(num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613"))
    } else if (grepl("gpt-4", model)) {
        message("Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.")
        return(num_tokens_from_messages(messages, model="gpt-4-0613"))
    } else {
        stop(sprintf("num_tokens_from_messages() is not implemented for model %s. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.", model))
    }

    num_tokens <- 0

    for (message in messages) {
        num_tokens <- num_tokens + tokens_per_message
        for (key in names(message)) {
            value <- message[[key]]
            # Calculate tokens for each word, assuming each word is 4/3 tokens
            num_tokens <- num_tokens + (length(strsplit(value, " ")[[1]]) * 4/3)
            if (key == "name") {
                num_tokens <- num_tokens + tokens_per_name
            }
        }
    }

    num_tokens <- num_tokens + 3 # every reply is primed with assistant
    return(num_tokens)
}

In this version:

The tiktoken related code is removed.
The num_tokens calculation for each value in the message is done by splitting the value into words and then multiplying the word count by 4/3.
Other structure and flow remain the same as in the previous version.

Please let me know if you need further adjustments or if there's any other aspect of the code you'd like to discuss!

Can you write me a small RE functions that turns all repeats of " " into a single space?

Certainly! In R, you can use regular expressions (regex) to replace multiple spaces with a single space. The gsub function is commonly used for such tasks. Here's a small function that does exactly that:

r
replace_multiple_spaces <- function(text) {
    # Replace one or more spaces with a single space
    return(gsub(" +", " ", text))
}

# Example usage
original_text <- "This   is  a   test    string."
cleaned_text <- replace_multiple_spaces(original_text)
print(cleaned_text)

This function uses the regular expression " +" to match one or more spaces and replaces them with a single space. You can use this function to clean any string that may have irregular spacing.

Continue this conversation

|Terms of use|Privacy policy|

Python to R Translation of tokenizer count

You said:

ChatGPT said:

You said:

ChatGPT said:

You said:

ChatGPT said: